Home

Column

Gender Classification via Voice

Jake Whalen

CS 584 Final Project
Fall 2017
Start

Summary

Column

Choosing a Project

  • Topic? Sports, Beer, Other
  • Supervised or unsupervised learning?
  • Data Source: Download, Web Scrape, Social Media
  • Tools: Python, R, Weka, Tableau, Excel

Choice

  • Data from Kaggle
  • Audio Analysis
  • Supervised Learning
  • Classification
  • ML in Python
  • Presentation & Report in R Markdown
  • Excel for results transfer

Goals

  • Classify audio clip subjects gender
  • Learn what features best seperate gender in audio
  • Look for other potential clusters within the data

Method

Column

Exploration

  1. Read the data into R/Python
  2. Ran summary functions on features
  3. Plotted the data
  4. Look for patterns and relationships between features
  5. Determine what features seperate gender best

Column

Classification

  1. Used Scikit-learn in Python
  2. Split the data for training/testing (1/3, 2/3)
  3. Used gridsearch to identify the best parameters
  4. KNN (K-Nearest Neighbors)
  5. Decision Tree (DT)
  6. Suport Vector Machine (SVM)
  7. Observed prediction outcomes. Could do better.
  8. Attempt to Improve on initial results
  9. KNN: Transform data with PCA
  10. Decision Tree: Use multiple trees with Random Forest
  11. SVM: Transform data with PCA

Column

Review

  • Confusion Matrix
  • Overall Accuracy Scores
  • Male Accuracy
  • Female Accuracy
  • Parameter Influence
  • Graph Results

Overview

Column

Description

Dataset Comments
  • Database created to identify a voice as male or female, based upon acoustic properties of the voice and speech.
  • The dataset consists of 3,168 recorded voice samples, collected from male and female speakers.
  • The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz (human vocal range).
  • The samples are represented by 21 different features
  • Source: Voice Gender Data

Definitions

Sample

EDA

Column

Classes

Distributions

Boxplots

T Test

Heatmap

Scatter Plot

3D Plot

KNN

Column

K-Nearest Neighbors

Summary
  • Used untransformed data
  • Better then a dumb classifier
  • Manhattan Distance produced better CV results (p=1)
  • Distance weights outperformed uniform weights
  • Better at identifying males
Best Parameters
  • algorithm = auto; n_neighbors = 11; p = 1; weights = distance

Confusion Matrix

Column

Results

Decision Tree

Column

Decision Tree

Summary
  • Used untransformed data
  • MeanFun & Q25 account for +75% of feature importance
  • Better at identifying males
  • Tree
Best Parameters
  • criterion = entropy; max_depth = 15; presort = FALSE; splitter = random

Confusion Matrix

Column

Results

SVM

Column

Support Vector Machine

Summary
Best Parameters
  • C = 5.89999999999999

Confusion Matrix

Column

Parameters

Results

Log Reg

Column

Logistic Regression

Summary
Best Parameters
  • C = 0.2; penalty = l1

Confusion Matrix

Column

Parameters

Results

Random Forest

Column

Random Forest

Summary
  • Best overall accuracy using raw data
Best Parameters
  • criterion = entropy; max_depth = 14; n_estimators = 14

Confusion Matrix

Column

Parameters

Results

KNN (PCA)

Column

K-Nearest Neighbors (PCA)

Summary
Best Parameters
  • algorithm = auto; n_neighbors = 3; p = 1; weights = distance

Confusion Matrix

Column

Results

SVM (PCA)

Column

Support Vector Machine (PCA)

Summary
Best Parameters
  • C = 9.5

Confusion Matrix

Column

Parameters

Results

Log Reg (Normal)

Column

Logistic Regression (PCA)

Summary
Best Parameters
  • C = 0.35; fit_intercept = FALSE; penalty = l1

Confusion Matrix

Column

Parameters

Results

Conclusions

Criteria


Discuss criteria. Females hard to get right. Males easier.

ROC


Inital models were bad. Transforming data and using alternate algorithms improved performance.

Findings